Serveur d'exploration Tamazight

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A massively parallel corpus: the Bible in 100 languages

Identifieur interne : 000083 ( Main/Exploration ); précédent : 000082; suivant : 000084

A massively parallel corpus: the Bible in 100 languages

Auteurs : Christos Christodouloupoulos [États-Unis] ; Mark Steedman [Royaume-Uni]

Source :

RBID : PMC:4551210

Abstract

We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.


Url:
DOI: 10.1007/s10579-014-9287-y
PubMed: 26321896
PubMed Central: 4551210


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A massively parallel corpus: the Bible in 100 languages</title>
<author>
<name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
<affiliation wicri:level="2">
<nlm:aff id="Aff1">Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
<wicri:cityArea>Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
<affiliation wicri:level="4">
<nlm:aff id="Aff2">School of Informatics, University of Edinburgh, Edinburgh, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>School of Informatics, University of Edinburgh, Edinburgh</wicri:regionArea>
<placeName>
<settlement type="city">Édimbourg</settlement>
<region type="country">Écosse</region>
</placeName>
<orgName type="university">Université d'Édimbourg</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26321896</idno>
<idno type="pmc">4551210</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551210</idno>
<idno type="RBID">PMC:4551210</idno>
<idno type="doi">10.1007/s10579-014-9287-y</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000067</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000067</idno>
<idno type="wicri:Area/Pmc/Curation">000066</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000066</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000076</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000076</idno>
<idno type="wicri:Area/Ncbi/Merge">000197</idno>
<idno type="wicri:Area/Ncbi/Curation">000197</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000197</idno>
<idno type="wicri:doubleKey">1574-020X:2014:Christodouloupoulos C:a:massively:parallel</idno>
<idno type="wicri:Area/Main/Merge">000083</idno>
<idno type="wicri:Area/Main/Curation">000083</idno>
<idno type="wicri:Area/Main/Exploration">000083</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A massively parallel corpus: the Bible in 100 languages</title>
<author>
<name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
<affiliation wicri:level="2">
<nlm:aff id="Aff1">Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
<wicri:cityArea>Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
<affiliation wicri:level="4">
<nlm:aff id="Aff2">School of Informatics, University of Edinburgh, Edinburgh, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>School of Informatics, University of Edinburgh, Edinburgh</wicri:regionArea>
<placeName>
<settlement type="city">Édimbourg</settlement>
<region type="country">Écosse</region>
</placeName>
<orgName type="university">Université d'Édimbourg</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Language Resources and Evaluation</title>
<idno type="ISSN">1574-020X</idno>
<idno type="eISSN">1574-0218</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kanungo, T" uniqKey="Kanungo T">T Kanungo</name>
</author>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author>
<name sortKey="Mao, S" uniqKey="Mao S">S Mao</name>
</author>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Zheng, Q" uniqKey="Zheng Q">Q Zheng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcus, M" uniqKey="Marcus M">M Marcus</name>
</author>
<author>
<name sortKey="Santorini, B" uniqKey="Santorini B">B Santorini</name>
</author>
<author>
<name sortKey="Marcinkiewicz, Ma" uniqKey="Marcinkiewicz M">MA Marcinkiewicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Och, Fj" uniqKey="Och F">FJ Och</name>
</author>
<author>
<name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Potthast, M" uniqKey="Potthast M">M Potthast</name>
</author>
<author>
<name sortKey="Barr N Cede O, A" uniqKey="Barr N Cede O A">A Barrón-Cedeño</name>
</author>
<author>
<name sortKey="Stein, B" uniqKey="Stein B">B Stein</name>
</author>
<author>
<name sortKey="Rosso, P" uniqKey="Rosso P">P Rosso</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author>
<name sortKey="Olsen, M" uniqKey="Olsen M">M Olsen</name>
</author>
<author>
<name sortKey="Diab, M" uniqKey="Diab M">M Diab</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, Cp" uniqKey="Wei C">CP Wei</name>
</author>
<author>
<name sortKey="Yang, Cc" uniqKey="Yang C">CC Yang</name>
</author>
<author>
<name sortKey="Lin, Cm" uniqKey="Lin C">CM Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations>
<list>
<country>
<li>Royaume-Uni</li>
<li>États-Unis</li>
</country>
<region>
<li>Illinois</li>
<li>Écosse</li>
</region>
<settlement>
<li>Édimbourg</li>
</settlement>
<orgName>
<li>Université d'Édimbourg</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Illinois">
<name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
</region>
</country>
<country name="Royaume-Uni">
<region name="Écosse">
<name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Linguistique/explor/TamazightV2/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000083 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000083 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Linguistique
   |area=    TamazightV2
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:4551210
   |texte=   A massively parallel corpus: the Bible in 100 languages
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:26321896" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a TamazightV2 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Wed Nov 15 18:28:35 2017. Site generation: Sat Feb 10 16:46:27 2024